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AN EFFICIENT IMPLEMENTATION OF N-POINT DCT, N-POINT IDCT, 
SA-DCT AND SA-IDCT ALGORITHMS 

5 FIELD 

This invention relates to an implementation of algorithms for 
multimedia compression and decompression, more particularly, efficient 
implementation of n-point discrete cosine transform, n-point inverse discrete 
cosine transform, shape adaptive discrete cosine transform, and shape 

JO adaptive inverse discrete cosine transform algorithms using SIMD operations, 

MMX™ instructions, VLSI implementation, single processor 

yl implementation or vector processing. 

in BACKGROUND 

ffl Computer multimedia applications typically involve the processing of 

yi5 high volumes of data values representing audio signals and video images. 

Processing the multimedia data often includes performing transform coding 
which is a method of converting the data values into a series of transform 
coefficients for more efficient transmission, computation, encoding, 
compression, or other processing algorithms. 

20 More specifically, the multimedia data values often represent a signal 

as a function of time. Transform coefficients represent the same signal as a 
function, for example, of frequency. There are numerous transform 
algorithms, including the fast Fourier transform (FFT), the discrete cosine 
transform (DCT), and the Z transform. Corresponding inverse transform 
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algorithms, such as an inverse discrete cosine transform (iDCT), convert 
transform coefficients to sample data values. Many of these algorithms 
include multiple mathematical steps that involve decimal numbers. 

In an effort to allow for easy interchange of graphical formats, the 
International Standards Organization (ISO) and the Consultative Committee 
for International Telegraph and Telephone (CCITT) formed the Joint 
Photographic Experts Group (JPEG) and the Moving Pictures Expert Group 
(MPEG). The JPEG /MPEG committee published compression standards that 
use the Discrete Cosine Transform (DCT) algorithm to convert a graphics 
image to the frequency domain. Efficient implementations of the DCT 
algorithm is an interest since JPEG and MPEG algorithms strive to achieve 
real-time performance. Most multimedia development software that uses 
this type of compression depend on the use of a coprocessor to generate 
compression. 

DCT is widely used in one dimensional (ID) and two dimensional (2D) 
signal processing. 2D 8x8 DCT is the basis for JPEG and MPEG compression. 
While there are presently algorithms that directly compute 2D 8x8 DCT, 
taking the 8-point ID transform of the rows and the columns is equivalent to 
the 2D 8x8 transform. There exists algorithms that compute ID 8-point DCT 
which can be used in the row-column method to perform a 2D 8x8 DCT. 
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BRTF.F DESCRIPTION OF THE DRAWINGS 

Additional advantages of the invention will become apparent upon 
reading the following detailed description and upon reference to the 
drawings, in which: 

Fig. 1 is a block diagram depicting multimedia compression and 
decompression, in another embodiment of the present invention; 

Fig. 2 depicts a bounding box and macroblocks of an arbitrary shaped 
video object in another embodiment of the present invention; 

Figs. 3a - 3e depicts the SA-DCT baseline algorithm for coding an 
arbitrarily shaped image segment contained within an 8x8 block, in another 
embodiment of the present invention; 

Fig. 4 depicts one embodiment of video compression; 

Fig. 5 depicts one embodiment of video decompression; 

Fig. 6 depicts one embodiment of SA-DCT; 

Fig. 7 depicts one embodiment of SA-IDCT; 

Fig. 8 depicts one embodiment of Single Instruction-Multiple-Data 
(SIMD); 

Fig. 9a depicts one embodiment of n-point DCT/IDCT; 

Fig. 9b depicts a factored embodiment of n-point DCT/IDCT. 

DETAILED DFSCRIPTION 

Exemplary embodiments are described with reference to specific 
configurations. Those skilled in the art will appreciate that various changes 
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and modifications can be made while remaining within the scope of the 
claims. 

Multimedia extension (MMX™) is used to implement SIMD 
operations. Existing algorithms do not reduce the clock cycle count of the 
implementation in MMX™ although they minimize the number of addition 
and multiplication operations. In another embodiment, the PMADDWD 
instruction used in existing algorithms multiplies and adds, making it 
unworkable to obtain four discrete 32-bit values from four sets of 16-bit 
multiplies. The present invention reduces processor time by having 
operations done with minimal PMADDWD instructions. 

The invention provides an efficient implementation of n-point 
discrete cosine transform, n-point inverse discrete cosine transform, shape 
adaptive discrete cosine transform (SA-DCT) and shape adaptive inverse 
discrete cosine transform (SA-IDCT) algorithms for multimedia compression 
and decompression optimization. An n-point DCT function is represented by 
a first equation having an input matrix, an output matrix and a matrix of 
predetermined values. An n-point IDCT function is represented by a second 
equation having an input matrix, an output matrix and a matrix of 
predetermined values. The multiplication operations within the matrix of 
predetermined values are paired, thereby reducing processor instructions. In 
another embodiment, SIMD operations are used to perform the algorithms. 
In another embodiment, MMX operations being one type of SIMD operations 
is used to perform the algorithms. In another embodiment, vector processing 
is used to perform the algorithms. In another embodiment, single processor 
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implementation is used to perform the algorithms. In yet another 
embodiment VLSI implementation is used to perform the algorithms. 

In an embodiment, a machine readable storage medium having 
executable instructions which, when executed by a processor, implements n- 
point discrete cosine transform (n-point DCT) algorithms, n-point inverse 
discrete cosine transform (n-point IDCT) algorithms, shape adaptive discrete 
cosine transform (SA-DCT) algorithms and shape adaptive inverse discrete 
cosine transform (SA-IDCT) algorithms for multimedia compression and 
decompression is provided. A machine-readable storage medium includes 
any mechanism that provides (i.e., stores and/or transmits) information in a 
form readable by a machine (e.g., a computer). For example, a machine- 
readable medium includes read only memory (ROM); random access memory 
(RAM); magnetic disk storage media; optical storage media; flash memory 
devices; electrical, optical, acoustical or other form of propagated signals (e.g., 
carrier waves, infrared signals, digital signals, etc.); etc. 

In another embodiment, once the video signal has been stored as data 
in the computer system memory, the data is manipulated at compression 
stage 6, which may include compressing the data into a smaller memory 
space. In FIG. 1, at stage 6, by occupying a smaller memory space, the video 
signal is more easily stored or transmitted because there is less data to store or 
transmit, requiring less processing power and system resources. Video signal 
16, stored in memory registers of the computer system, is directed to 
compression stage 6. In the spatial domain, video signal 16 is represented by a 
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waveform in which the amplitude of the signal is indicated by vertical 
displacement while time or space is indicated by horizontal displacement. 

For many compression methods it is desirable to transform a signal 
from the spatial domain to another domain, such as the frequency domain, 
before analyzing or modifying the signal. After video signal 16 is received at 
compression stage 6, the signal is transformed from the spatial domain to the 
frequency domain. In the frequency domain, the amplitude of a particular 
frequency component (e.g. a sine or cosine wave) of the original signal is 
indicated by vertical displacement while the frequency of each frequency 
component of the original signal is indicated by horizontal displacement. The 
video waveform 16 is illustrated in the frequency domain at step 18 within 
compression stage 6. 

In another embodiment, transformation of a signal from the spatial to 
the frequency domain involves performing a Discrete Cosine Transform of 
the data elements representing the signal. For example, in accordance with 
some JPEG and MPEG standards, square subregions of the video image, 
generally an 8 x 8 array of pixels, are transformed from the spatial domain to 
the frequency domain using a discrete cosine transform function. This 8x8 
array of pixels corresponds to 8x8 data elements, each data element 
corresponding to the value (e.g. color, brightness, etc.) of its associated pixel in 
the 8x8 array. For another embodiment, other transform functions are 
implemented such as, for example, a Fourier transform, a fast Fourier 
transform, a fast Hartley transform, or a wavelet transform. 
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In another embodiment of the present invention, the signal is 
reconverted back into the spatial domain by applying an inverse transform to 
the data. Alternatively, the signal remains in the frequency domain and is 
transformed back into the spatial domain during the decompression stage, as 
described below. 

Upon receiving the compressed video signal at receiving stage 10, the 
data associated with the signal is loaded into computer system memory. In 
addition, if the video signal is encrypted, it is decrypted here. At 
decompression stage 12, the signal is decompressed by a method including, for 
example, applying an inverse transform to the data to translate the signal back 
into the spatial domain. This assumes the signal has been transmitted in a 
compressed format in the frequency domain from computer system 24. For an 
embodiment in which the compressed video signal is transmitted in the 
spatial domain, application of an inverse transform during the 
decompression stage may not be necessary. However, decompression of a 
video signal may be more easily accomplished in the frequency domain, 
requiring a spatial domain signal received by decompression stage 12 to be 
transformed into the frequency domain for decompression, then back into the 
spatial domain for display. 

Once decompressed, the signal is transferred to display stage 14, which 
may comprise a video RAM (VRAM) array, and the image is displayed on 
display device 30. Using this technique, a user at computer system 24 can 
transmit a video image to computer system 26 for viewing at the second 
computer terminal. Similarly, computer system 26 may have similar video 
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and audio transmission capabilities (not shown), allowing display and audio 
playback on display device 28 and speakers 32, respectively, of computer 
system 24. In this manner, applications such as video conferencing are 
enabled. 

As shown in Figure 4, SA-DCT can be used in one embodiment of 
video compression devices 490. Motion estimation 410 and motion 
compensation 420 can remove the temporal redundancy in the pictures. SA- 
DCT 430 can remove the spatial redundancy by transforming "time-domain" 
information into "frequency-domain" information with help from 
Quantization 440. Variable Length Encoder (VLC) 450 compresses the 
frequency-domain data into bits. Inverse Quantization 460, SA-IDCT 470, and 
motion compensation 480 are used to improve the encoding quality. 

As shown in Figure 5, SA-IDCT can be used in one embodiment of 
video decompression devices 560. VLD 510 and Inverse Quantization 520 
reverse bits into frequency-domain data. SA-IDCT 470 reverses the frequency- 
domain data into spatial domain data. Motion compensation 540 reconstructs 
the images 550 and 570. 

As shown in Figure 6, n-point DCT can be used in one embodiment of 
SA-DCT 430. First, the data is shifted in the vertical direction 432 (Figure 3b). 
Second, n-point DCT is performed column by column 434 (Figure 3c). The 
data is shifted in the horizontal direction 436 (Figure 3d). As shown in figure 
3e, n-point DCT is performed row by row 438. 

As shown in Figure 7, n-point IDCT can be used in one embodiment of 

SA-IDCT 470. N-point DCT is performed row by row 472. Data is shifted in 

9 
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the horizontal direction 474. N-point DCT is performed column by column 
476. The data is shifted in the vertical direction 478. 

As shown in Figure 8, SIMD uses single instruction to operate on 
multiple data. 64-bit data 820 contains 16-bit data 822, 824, 826, and 828. 64-bit 
data 840 contains 16-bit data 842, 844, 846, and 848. PMADDWD is used (one of 
the MMX instructions which is one type of SIMD instructions) to add the 
multiplication result of 822 and 832 and the multiplication result of 824 and 
834 as well to add the multiplication result of 826 and 836 and the 
multiplication result of 828 and 838. 

As shown in Figure 9, matrix multiplication can be used for n-point 
DCT/IDCT 434, 438, 472, and 478. For n-point DCT, input [X] is the frequency- 
domain data 930 and output [Y] is the time-domain data 910. In one 
embodiment, matrix [A] is factored into [S][M][B], where the number of 
multiplications is reduced. An embodiment of this invention is to use SIMD 
operation for n-point DCT/IDCT. As applied to DCT, matrix 910 represents 
frequency domain data, and matrix 930 represents time domain data. As 
applied to IDCT, matrix 910 represents time domain data, and matrix 930 
represents frequency domain data. 

JPEG lossy compression algorithms operate in three successive stages, 

DCT transformation, coefficient quantization, and lossless compression. DCT 

is a class of mathematical operations that include the Fast Fourier Transform 

(FFT). The basic operation performed by FFT is to transform a signal from 

one type of representation to another. DCT is used for compression and IDCT 

is used for decompression. During compression, DCT transforms a set of 

10 
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points from the spatial domain into a representation in the frequency 
domain. During decompression, an IDCT function converts the spectral 
representation of the signal back to a spatial one. The formula for the DCT 
and IDCT is shown in table 1 and table 2, respectively. 

Dcnu) - ^_ ao c U )f | P ^i {x , y) cos]Q^]cos\S^-] 

1 

C(x) = -t= if x is 0, else 1 if x > 0 

Table 1 



Pixel(x,y) = -^=% X C(i)CU)DCT(i,j)COS 



2N J L 2N J 



C(x) = —= if x is 0, else 1 if x > 0 
V2 



Table 2 



One embodiment of the DCT algorithm is performed on an N x N 
square matrix of pixel values, and it yields an N x N square matrix of 
frequency coefficients. DCT performs a matrix multiplication of the input 



042390.P8657 Express Mail No. EM522828778US 

pixel data matrix by the transposed cosine transform matrix and stores the 
result in a temporary N by N matrix. The temporary matrix is multiplied by 
the cosine transform matrix, and the result is stored in the output matrix. 

The DCT computation complexity is simplified by factoring out the 
transformation matrix into butterfly and shuffle matrices. The butterfly and 
shuffle matrices can be computed with fast integer addition, the resulting 
zeroes in the original matrix being trivial to compute. In most of the fast DCT 
algorithms, optimization usually focuses on reducing the number of DCT 
arithmetic operations, especially the number of multiplications* 

IDCT essentially uses the reverse of the operations performed in the 
DCT. In one embodiment, the DCT values in the N by N matrix are 
multiplied by the cosine transform matrix. The result of this transformation 
is stored in a temporary N by N matrix. This matrix is then multiplied by the 
transposed cosine transform matrix. The result of this multiplication is 
stored in the output block of pixels. 

The MPEG-4 video coding standard supports arbitrary-shape video 

objects in addition to the conventional frame-based functionalities in MPEG-1 

and MPEG-2. Thus, in MPEG-4, the video input is no longer considered as a 

rectangular region. One of the building blocks for MPEG-4 video coding 

standard version 2 is the shape-adaptive-DCT (SA-DCT) for arbitrary shape 

objects. In an MPEG-4 image, there are contour macroblocks which contain 

the shape edge of an object, as shown in Fig. 2. Instead of performing an 8x8 

DCT after filling the non-object pixels, the new standard adaptively performs 

N-point DCT based on the shape. For contour macroblocks, only object pixels 

12 
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are transformed into DCT domain. The procedure of transforming only 
object pixels into DCT domain is called shape-adaptive DCT. In one 
embodiment, this invention optimizes SA-DCT and SA-IDCT for MPEG-4 
object based coding scheme using platform-dependent knowledge. Compared 
to 8x8-DCT, SA-DCT provides a significantly better rate-distortion trade-off, 
especially at high bit rates. 

Standard 8x8 DCT is applied to 8x8 blocks with 64 opaque pixels. In 8x8 
blocks that straddle the boundaries of a VOP, standard DCT is replaced by 
shape adaptive DCT (SA DCT). These boundary blocks are arbitrary shape 
with at least one transparent pixel in which the number of opaque pels is less 
than 64. 

Similar to standard DCT, forward and inverse SA DCT convert 
pixel(x,y) to DCT(i,j) and vice versa. SA DCT also keeps all conditions on the 
internal precision of floating point arithmetic as well as the rounding to 
integers and the dynamic ranges of pixel(x,y) and DCT(i,j) stated in 8x8 DCT. 
In contrast to standard 8x8 DCT, the internal processing of SA DCT is 
controlled by shape parameters, which are derived from the decoded VOP 
shape. The opaque pixels within the boundary blocks are only transformed 
and coded. As a consequence, SA DCT does not require the padding 
technique, if shape coding is lossless, and the number of achieved SA DCT 
coefficients is identical to the number of opaque pixels in the given boundary 
block. 

Figures 3a-3e depict the SA-DCT baseline algorithm for coding an 

arbitrarily shaped image segment contained within an 8x8-block. The SA- 

13 
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DCT algorithm is based on predefined orthonormal sets of DCT basis 
functions. The forward 2D SA-DCT first applies ID DCT transformation to 
columns, and then to rows. The inverse 2D SA-DCT applies the ID IDCT 
transform first to rows, then to columns. Figure 3a depicts an image block 
segmented into two regions, foreground as shown in gray and background as 
shown lighter. To perform the vertical transform of the foreground, the 
length (vector size N, 0<N<9) of each column j (0<j<9) of the foreground 
segment is calculated. As depicted in Figure 3b, the columns are shifted and 
aligned to the upper border of the 8x8 reference block. 

While dependent on the vector size N of each particular segment 
column, a ID n-point DCT, a transform kernel containing a set of N basis 
vectors DCT-n, is selected for each particular column and applied to the first 
N column pixels. For example, as depicted in Figure 3b, the right most 
column is transformed using 3-point DCT. As depicted in Figure 3d, before 
the horizontal DCT transformation, the rows are shifted to the left border of 
the 8x8 reference block. Figure 3e depicts the final location of the resulting 
DCT coefficients within an 8x8-image block. 

The final number of DCT coefficients is identical to the number of 
pixels contained in the image segment. Additionally, the coefficients are 
located in comparable positions as in a standard 8x8 block. The DC coefficient 
is located in the upper left border of the reference block and is dependent on 
the actual shape of the segment. The remaining coefficients are concentrated 
around the DC coefficient. Since the contour of the segment is transmitted to 
the receiver prior to transmitting the macroblock information, the decoder 

14 



042390.P8657 



Express Mail No. EM522828778US 



performs the shape-adapted inverse DCT as the reserve operation in both 
horizontal and vertical segment direction on the basis of decoded shape data. 
ID N-point DCT is accomplished by the following equation: 



fc 1 (n(2k + l) } 
where c 0 = 4^ and c n =^ forn = l,...,N- 1 



Table 3 



The computation of the 2-point DCT can be simplified as follows: 
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Table 4 
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In conventional algorithmic optimization, the number of additions and 
multiplications is minimized. Thus, 

Jz 0 = x 0 + x x 
i z i = x o - x l 



Y0 = 71 z ° 
Y1 = 7f 21 



Table 5 

In this way, only two additions and two multiplications are performed, 
instead of two additions and four multiplications. The following C code for 
this algorithm is currently used. 

void fsadct2_float (float in[2], float out[2]) 
{ 

static float fO = 0.707107; 

out[0] = (in[0] + in[l]) * fO; 
out[l] = (in[0] - in[l]) * fO; 

} 

In one embodiment of the invention, using MMX™ and Streaming 
SIMD Extensions, two additions and four multiplications can be performed 
quickly with only one PMADDWD instruction for the 2-point DCT as follows: 

void fsadct2_mmx (short in[2], short out[2]) 
{ 

static _int64 xstaticl = 0xA57E5A825A825 A82; / / -fO fO fO fO 
static __int64 rounding = 0x0000400000004000; 
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asm { 

mov eax, in 
mov ecx, out 

movd mmO, [eax] / / mmO = xx, xx il, iO, 

5 pshufw mml, mmO, 01000100b // mml = il, iO, il, iO, 

pmaddwd mml, xstaticl // mml = i0*f0 -il*f0, i0*f0 + 

il*f0 

paddd mml, rounding //do proper rounding 

psrad mml, 15 

10 packssdw mml, mm7 / / mml = x, x, ol, oO, 

movd [ecx], mml 

} 



Q 5 The computational complexity of the 3-point DCT can be simplified as 

f: follows: 
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Table 6 
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In one embodiment of the invention the following is the MMX™ code 
implementation of the 3 point DCT algorithm: 

void fsadct3_mmx (short in [3], short out [3]) 

{ 

static _int64 xconstl - Ox0000000049E749E7; // fO fO 0 1 
static _int64 xconst2 = 0x5A82A57E977D3441; // fl -fl -f3 £2 
static _int64 rounding = 0x0000400000004000; 



asm { 

mov eax, in 

movd mmO, [eax] // 0 0 il iO 

movd mm5, [eax+2] // 0 0 i2 il 

movq mm7, rounding 



pshufw mm4, mmO, 00111100b // iO 0 0 iO 
pshufw mm3, mm5, 11010001b // 0 il il i2 
paddsw mm4, mm3 // iO i2 il i0+i2 

movq mml, mm4 

pmaddwd mm4, xconstl // 0 (i0+il+i2 * fO) « 15 

//o0 « 15 

pmaddwd mml, xconst2 // (i0-i2)*fl « 15 i0+i2*f2 - 

il*f3 

//ol«15 o2«15 



paddd mml, mm7 //do proper rounding 

paddd mm4, mm7 // do proper rounding 

psrad mml, 15 // o0 

psrad mm4, 15 // ol o2 
packssdw mml, mm7 / / x x ol o2 

mov eax, out 

pshufw mm2, mml, 11110001b / / x x o2 ol 
packssdw mm4, mm7 / / x x x oO 

movd [eax], mm4 // save oO 
movd [eax+2], mm2 / / save ol, o2 



The 4-point DCT can be computed as shown in Table 7. Multiplication 
operations can be paired (or grouped) within the matrix. 
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YO 
Yl 
Y 2 
Y 3 
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Table 7 



: i0 



The above 4-point DCT can be further written as: 
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Table 8 

Even if the upper left cosine block in the original matrix is further 
20 factored, leaving two multiplication operations, two PMADDWD operations 

would still be needed, plus a substantial amount of additional instructions to 
shuffle and add the results. Many existing algorithms do not reduce the clock 
cycle count of the implementation in MMX™ although they minimize the 
number of addition and multiplication operations. To reduce processor time 
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by having operations done with minimal SIMD operations (e.g., 
PMADDWD), the above 4-point DCT can be further written as: 
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Table 9 



□ In one embodiment of the invention, the following is the MMX™ 

jjlO implementation of the 4 point DCT algorithm: 

void fsadct4_mmx (short in[4], short out[4]) 
{ 

static _int64 xstaticl - 0x4000C00040004000; // fO -fO fO fO 
15 static _int64 xstatic2 = 0xDD5D539F22A3539F; // -fl £2 fl £2 

static _int64 rounding = 0x0000400000004000; 

asm { 

mov eax, in 
20 mov ecx, out 

movq mmO, [eax] // i3 i2 il iO 

pshufw mml, mmO, 00011011b // iO il i2 i3 
movq mro2, mml 

paddsw mm2 7 mmO // bO bl bl bO 

25 psubsw mmO, mml / /-b3 -b2 b2 b3 

pmaddwd mm2, xstaticl / / ol « 15 oO « 15 

pmaddwd mmO, xstatic2 / / o3 « 15 o2 « 15 
paddd mm2, rounding / / do proper rounding 

paddd mmO, rounding / / do proper rounding 

30 psrad mm2, 15 
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psrad mmO, 15 
packssdw mrrL?, mmO 



// o3 ol o2 oO 



pshufw mm3, mml, 11011000b // o3 o2 ol oO 
movq [ecx], mm3 



The following matrix definitions are presented for illustrative purposes 
to define, or name, specific matrices in tables 7, 8 and 9. The values within 
the matrices defined below represent one embodiment of the invention. 



As shown in Table 8, Shuffle Matrix [S] = 

" 1 0 0 0 " 

0 0 10 

0 10 0 

0 0 0 1 




= Matrix [A], as shown in Table 7 



i i i 



-x/2 1 -1 



0 




= Multiplication Matrix [M], as shown in Table 9 
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As shown in Table 8, Butterfly Matrix [B] = 

"10 0 1 
0 110 
0 1-10 
10 0 -1 

Group 1 and Group 2 as shown below are presented for illustrative 
purposes to define a part of matrix [M] in Table 8. The values below represent 
one embodiment of the invention. Group 1 and Group 2 are "paired." or 
"grouped". That is, the multiplication operations within matrix [M] of 
predetermined values are paired, thereby reducing processor instructions. 



Group 1 = 



1 

V2 
1 

V2 



Group 2 = 

1 

_ _1_ 

The method described above can be provided in applications (e.g., 
video applications) to potentially increase the performance of the applications 
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by decreasing the time to perform n-point DCT, n-point IDCT, SA-DCT, and 
SA-IDCT over known techniques. In one embodiment, the MMX™ versions 
of the n-point DCTs performed from 1.3 to 3.0 times faster than fixed-point 
versions. In one embodiment in which a complete and optimized 
implementation of SA-DCT/ SA-IDCT on Intel processors is demonstrated, 
the SA-DCT /SA-IDCT process is increased by 1.1 to 1.5 times. 

Also compared in table 10 is the performance of an MMX™ 8x8 
DCT/IDCT embodiment. 





Time (seconds after 10 
million iterations) 


Increase in speed when 
using MMX™ 




Floating- 
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Speed Increase 
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1.35 
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IDCT 
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IDCT 
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Table 10 



Having disclosed exemplary embodiments, modifications and 
variations may be made to the disclosed embodiments while remaining 
within the spirit and scope of the invention as defined by the appended 
claims. 

The following code, shown in the Appendix, represents one 
embodiment of the invention to implement the 5-point DCT,6-point DCT,7- 
point DCT, and 8-point DCT and the 2-point IDCT,3-point IDCT,4-point 
IDCT,5~point IDCT,6-point IDCT,7-point IDCT, and 8-point IDCT algorithms. 



Appendix 



void fsadct5__rcvmx (short in [5], short out [5]) 

{ 

static _int64 xconstl = 0x2F954CFEE6FC417E; // f 2 f 1 -f4 £3 

static __int64 xconst2 = 0xB3022F95BE821904; // -ft f2 -f3 f4 

static „int64 xconst3 = 0x393E393E000050F4; // fO fO 0 f 5 
static __int64 rounding = 0x0000400000004000; 

asm { 

mov eax, in 

movq mmO, [eax] / / i3 i2 il iO 
pshufw mml, [eax+2], 01011011b // i2 i2 i3 i4 

movqmir^/rnrnO // i3 i2 il iO 

rmvqrrim6 / mml // i2 i2 i3 i4 
paddswmmO/rnml // x x b2b0 

psubsw rnm2, rnml // x x b3bl 

punpcklwd mniO, mm2 //b3b2blb0 
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pshufw mini, mmO, 11011000b / / b3 bl b2 bO 
movq mm2, mml 

pmaddwdmml,xconstl // (bl*fl+b3*f2) « 15 (b0*f3-b2*f4) « 15 
5 / / ol o2 + b4 

// SAVE 

movq mm5, rounding 

pmaddwdmrn2,xconst2 // (bl*f2-b3*fl) « 15 (b0*f4-b2*f3) « 15 
//o3 o4-b4 
10 // SAVE 

pshufw mm4, mmO, 00001000b / / 0 0 b2 bO 

psllq mm4,32 // b2b0 0 0 

psrlqmm6,48 // 0 0 0 i2 
1 5 pshufw mm3, mm6, 11001100b / / 0 i2 0 i2 

paddsw mm4, mm3 / / b2 b0+i2 0 i2 

pmaddwdmm4,xconst3 // (b2+b0+i2)*f0 i2*f5 
// (o0)«15 (b4)«15 

movq mm7, mm4 

20 paddd mm7, mm5 / / do proper rounding 

psrad mm7, 15 //o0x 

packssdw rnm7, mmO // x x oO x 
psrlq mm7, 16 
mov eax, out 

25 movd [eax], rnm7 / / store oO 
psllq mm4 / 32 

psrlq mmi 32 / / 0 (b4 « 15) 

psubdrnrnl / rnm4 // (ol « 15) (o2 « 15) 

30 padddmm2,mm4 // (o3 « 15) (o4 « 15) 

paddd mml, mm5 / / do proper rounding 

paddd mm2, mm5 / / do proper rounding 

psrad mml, 15 / / x ol x o2 

psrad mm2, 15 / / x o3 x o4 

35 packssdw mml, mm2 // o3 o4 ol o2 

pshufw mmO, mml, 177 / / o4 o3 o2 ol 

movq [eax+2], mmO 



40 



60 



void fsadct6_mmx (short in[6], short out[6]) 



static „_int64 xconstl = 0xCBBF344134413441; // -f0 fO f 0 fO 
static __int64 xconst2 = 0xB61924F34000C000; // -f3 f2 fl -fl 
45 static _Jnt64 xconst3 = 0x0000132034414762; // 0f4f0 f5 

static _int64 xconst4 = Ox00004762CBBF1320; // 0 f 5 -fO f4 
static „_int64 rounding - 0x0000400000004000; 

asm { 

50 mov eax, in 

movqmmO, [eax] // i3 i2 il iO 

movq mml, [eax+4] // i5 i4 i3 i2 

xor eax, eax 

movq mm7, rounding 
5 5 pshufw mm2, mml, 01011011b / / i3 i3 i4 i5 

movq mml, mmO 

paddsw mmO, mm2 

pinsrw mmO, eax, 3 // mm0 = 0b2blb0 

psubswmml,mm2 // mml = 0 b5 b4 b3 



pshufw mm6,mm0, 11111110b // 0 0 0 b2 
paddsw mm6,mm0 // 0 x bl b0+b2 
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pshufw mm3, mml, 11111011b // 0 0 b5 0 
paddsw rnrn3,rnrnl //Ox b4+b5 b3 

psllq mm3, 32 

pshufw mm2, mm6, 11110100b 
5 paddsw mmi mm3 / / b4+b5 b3 bl b0+b2 

pmaddwd mrcd, xconstl / / mm2 - o3 « 15 oO « 15 

pshufw mm3, mmO, 01 100010b / / bl b2 bO b2 
pshufw mm^mmO, 11001111b // ObO 0 0 
1 0 paddsw mire, mm4 / / bl b0+b2 bO b2 

pmaddwd rnm3, xconst2 / / mm3 = o4 « 15 o2 « 15 

movq rnm4, mml 

pmaddwd mml, xconst3 / / b5 * f4 « 15 (b4 * fO + b5 * f4) « 15 
15 pmaddwd mm4,xconst4 //b5*f5«15 <-b4 * fO + b3 * f4) « 15 

pshufw mm5, mml, 00001110b // x x b5 * f4 « 15 
pshufw mm6, mm4, 00001110b // x x b5 * f5 « 15 
paddd rnm5,mml / / mm5 = x ol « 15 



20 



paddd mm6, mm4 / / mm6 = x o5 « 15 

paddd rnm2, mm7 / / do proper rounding 

paddd mm3, mm7 

paddd mrr6,mm7 
^ ^ paddd mm6 / mm7 

y *25 psrad rnm2, 15 // x o3 x oO 

psrad mirC, 15 / / x o4 x o2 

£1 psrad mm5, 15 

irl psrad mm6, 15 

1:30 mov eax, out 

packssdw nim3, rnm2 // o3 oO o4 o2 

^ pshufw mini, mm3, 01 1 10010b / / o4 o3 o2 oO 

KsJ movqmrm^mrriZ 

dj punpcklwd mml, mm5 / / x x ol oO 

L } 35 movd [eax],rnml // store oO, ol 

fi psrlq mml, 16 / / 0 o4 o3 o2 

21 psllq mm6, 48 // o5 0 0 0 

paddsw mml, mm6 / / o5 o4 o3 o2 

movq [eax+4], mm2 / / store o2, o3, o4 r o5 

40 } 



void fsadct7_mmx (short in[7], short out[7]) 
{ 

45 static int f0_7 = 0x3061; 

static int fl_7 = 0x42B4; 

static int f2_7 = 0x3DA5; 

static int f3_7 = 0x357E; 

static int f4_7 = 0x2AA9; 
50 static int f5_7 = OxlDBO; 

static int f6_7 = 0x0F39; 

static int f7_7 = 0x446B; 

static int b[7]; 

55 static _int64 xconstl = 0x357E42B40F393DA5; // f3 fl f6 £2 

static _int64 xconst2 - 0xE250357EC25B2AA9; // -f5 f3 -f2 f4 
static _int64 xconst3 = 0xBD4ClDB0D5570F39; // -fl f5 -f 4 f6 
static _int64 xconst4 = Ox0000446B30613061; // 0 f7 f 0 f 0 
static int64 xconstS = 0x00001 DBOO0OOD557; // 0 f5 0 -f4 

60 static __int64 xconst6 = 0x0000BD4C0000F0C7; // 0 -fl 0 -f 6 

static __int64 xconst7 = Ox0000357E00003DA5; // 0 f 3 0 f 2 
static _Jnt64 rounding = 0x0000400000004000; 
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10 



15 



_asm { 

mov eax, in 

movqmmO, [eax] // i3 i2 il iO 

movq mm2, [eax+6] / / i6 i5 i4 i3 

pextrw ecx, mmO, 3 / / i3 
pshufw mml, 1111112, 00011011b // i3 i4 i5 i6 
movq rnm2, mmO 



paddsw mmO, mml 
psubsw mrr^, mml 
movq mml, mmO 
punpcklwd mmO, mm2 
punpckhwd mml, mm2 



// x b4b2b0 
//0 b5b3bl 

//b3 b2 bl bO 
//Ox b5b4 



pshufw mm5, mml, 11011100b // 0 b5 0b4 
pshufw mm4,mm0,11011000b //b3blb2b0 



20 



TQ5 



Ji30 



movq mml, mm4 
movq mm2, mm4 
movq mm3 r mm4 

pmaddwd mml, xconstl 
pmaddwd mm2 / xconst2 
pmaddwd mm3, xconst3 



//b3 bl b2 bO 
//b3 bl b2 bO 
//b3 bl b2 bO 

// (bl*fl+b3*f3) « 15 (b0*f2+b2*f6) « 15 
// (M*f3-b3*f5) « 15 (b0*f4-b2*f2) « 15 
// (bl*f5-b3*fl) « 15 (b0*f6-b2*f4) « 15 



movq mm6, mm5 

pinsrw mm6, ecx, 2 //0i3 0b4 

paddsw mm0,mm6 // x b2+i3 x b4+b0 

pshufw mm6, mmO, 00001000b / / x x b2+i3 b4+b0 
movq mmO, rounding 



pinsrw mm6, ecx, 2 
pmaddwd mm6, xconst4 
movq mmi, mm6 



//xi3 b0+b2 b4+i3 
//b6«15 o0«15 



= 35 



40 



45 



50 



55 



60 



mov ecx, out 
paddd mm6, mmO 
psrad mm6, 15 
packssdw mm6, mml 
movd dword ptr [ecx], mm6 

movqmm6, mrrS 
movq mm7, mm5 
pmaddwd mm5, xconst5 
pmaddwd mm6, xconst6 
pmaddwd mm7, xconst7 



/ / do proper rounding 

/ / x x x oO 
// save oO 

//xb5xb4 
// xb5xb4 

// (+b5*f5) « 15 (-b4*f4) « 15 
// (-b5*fl) « 15 (-b4*f6) « 15 
// (+b5*f3) « 15 (+b4*f2) « 15 



paddd mml, mm5 
paddd mm2, mm6 
paddd mm3, mm7 
psrlq mm4, 32 

psubd mml, mm4 
paddd mm2, mm4 
psuMmm3,mm4 
paddd mml, mmO 
paddd mml, mmO 
paddd mrn3, mmO 

psrad mml, 15 
psrad mm2, 15 
psrad mm3, 15 



//ol«15 o2«15+b6 
//o3«15 o4«15-b6 
// o5 « 15 06 « 15 + b6 



// ol «15 o2«15 
// o3«15 o4«15 
//o5«15 o6«15 
/ / do proper rounding 
//do proper rounding 
/ / do proper rounding 

// ol o2 
// o3 o4 
// o5 06 



packssdw rnrn2, mml 



// ol o2 o3 o4 
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pshufw mml, mm2, 27 // o4 o3 o2 ol 

movq [ecx+2], mml / / save ol, o2, o3, o4 

packssdw mm3, mm7 //xxo5o6 

pshufw mml, rnrn3, 00000001b 

movd dword ptr [ecx+10], mml / / save o5, 06 



1 0 void fsadct8_mrnx (short in[8], short out[8]) 

{ 

static _Jnt64 xconstl = 0xA57E5A825A825A82;// -fO fO fO fO 
static _Jnt64 xconst2 - 0xD2BF2D412D412D41;// -f4 f4 f4 f4 
static _Jnt64 xconst3 = 0xC4E0187D187D3B20; // -f2 f6 f6 f2 
15 static _int64 xconst4 = 0x3536238E0C7C3EC5; // f3 f5 f7 fl 

static _Jnt64 xconstS = 0xDC723536C13B0C7C ;//-f5 f3 -fl f7 
static _Jnt64 rounding = 0x0000400000004000; 

asm { 

20 mov eax, in 

movq mmO, [eax] // i3 i2 il iO 

movq mml, [eax+8] / / i7 i6 i5 i4 

□ pshufw mmi mml, 00011011b / / i4 i5 i6 \7 

y movq mrn7, rounding 

lQ,5 movq mml, mmO 

].i paddswmmO,!!^ // mmO = b[3] b[2] b[l] b[0] (A*) 

J psubsw mml, mm2 / / mml = b[4] b[5] b[6] b[7] (A*) 

lj pshufw mm2, mmO, 00001011b // x x b[2] b[3] 

j30 mwqmrr^mmD 

* psubsw mm3, mm2 / / mm3 = x x b2[2] b2[3] (B*) 

paddsw mm2, mmO / / mm2 = x x b2[l] b2[0] (B*) 

1 pshufw nuni mml, 00001100b / / mm4 =xx b2[4] b2[7] (B*) 

% pshufw mmO, mml, 10011001b // b[5] b[6] b[5] b[6] 

": : 35 pmaddwd mmO, xconstl // b2[5] « 15 b2[6] « 15 

padddrnmO/mniZ // do proper rounding 

^ psrad mmO, 15 / / b2[5] b2[6] 

^ packssdw mmO, mml / / mmO = x x b2[5] b2[6] (B*) 

" 40 pshufw mm5, mm2, 01000100b // b2[l] b2f0] b2[l] b2[0] 

pmaddwd mm5, xconst2 // o4 « 15 oO « 15 
pshufw mm2, mm3, 01000100b / / b2[2] b2[3] b2[2] b2[3] 
pmaddwd mm2, xconst3 / / 06 « 15 o2 « 15 
paddd mm5, mm7 / / do proper rounding 

45 paddd mrn2,mm7 //do proper rounding 

psrad mm5, 15 
psrad mm2, 15 

packssdw mm5, mm2 / / mm5 = 06 o2 o4 oO ( Y*) 

movq inml, mm4 

50 paddsw mm4, mmO / / x x b3[4] b3[7] 

psubsw mml, mmO // x x b3[5] b3[6] 

punpcklwd mrni mml / / b3[5] b3[4] b3[6] b3[7] 

pshufw mm3, mm4, 11011000b // mm4 = b3[5] b3[6] b3[4] b3[7] (C*) 

5 5 rnovqrnrn4, mm3 

pmaddwd mm3, xconst4 / / o5 « 15 ol « 15 

pmaddwd rrtrr^, xconst5 // o3 « 15 o7 « 15 

paddd mm3, mm7 / / do proper rounding 

paddd mm4, mm7 / / do proper rounding 

60 psrad rnm3, 15 

psrad mm4, 15 

packssdw mm3, mm4 / / mm3 = o3 o7 o5 ol (Y*) 
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pshufw mmO, mm5, 11011000b // mmO = 06 o4 o2 oO 
pshufw mml, mm3, 1001 1 100b / / mm3 = o7 o5 o3 ol 
movq mm2, mmO 

punpddwd 3X11102/ mini / / o3 o2 ol oO 

punpckhwd mmO, mml / / o7 06 o5 o4 

mov eax, out 
movq [eax], mm2 
movq [eax+8], mmO 



//////////////////////////////////////////// 

void fsaidct2_mmx (short in[2], short out[2]) 
{ 

static __int64 xconstl = 0xA57E5A825A825A82; // -fO fO fO £0 
static __int64 rounding = 0x0000400000004000; 

asm { 

mov eax, in 
mov ecx, out 
movd mmO, [eax] 

pshufw mml, mmO, 01000100b / / il iO il iO 
pmaddwd mml, xconstl / / ol « 15 oO « 15 

paddd mml, rounding / / do proper rounding 

psrad mml, 15 

packssdw mml, mm7 / / x x ol oO 

movd [ecx], mml 



void fsaidct3_mmx (short in[3], short out[3]) 
{ 

static „_int64 xconstl = 0x49E7977D49E 73441; // fO -f3 fO f2 
static __int64 xconst2 = 0x0000A57E00005A82; / / 0 -fl 0 fl 
static _Jnt64 rounding = 0x0000400000004000; 

asm { 

mov eax, in 

movd mmO, [eax] // 0 0 il iO 

movd mml, [eax+2] // 0 0 i2 il 

mov eax, out 

movq mm7, rounding 

psllq mmO, 32 

paddd mmO, mml // il iO i2 il 

pshufw mml, mmO, 10011001b / / iO il iO i2 
pmaddwd mml, xconstl // ol « 15 b2 « 15 
pshufw mm2, mmO, 11111111b // il il il il 
pmaddwd mm2, xconst2 / / -bl « 15 bl « 15 
pshufw mm3, mml, 01000100b // b2 « 15 b2 « 15 
padddmrr^mmS // o2 « 15 oO « 15 

paddd mml, rnm7 //do proper rounding 

paddd mm2, mm7 

psrad mml, 15 / / x ol x x 

psrad mir^, 15 / / x o2 x oO 

movd [eax], mm2 / / store oO 

packssdw mml, mni2 / / o2 x ol x 

pshufw mm2, mml, 11111101b / / x x o2 ol 
movd [eax+2], mm2 // store ol, o2 
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void fsaidct4_mmx (short in[4], short out[4]) 
{ 

static __int64 xconstl = OxC000400040004000; // -fO fO fO fO 
5 static _Jnt64 xconst2 = 0xAC6122A322A3539F; // -f2 fl fl f2 

static „int64 rounding = 0x0000400000004000; 

asm { 

mov eax, in 

1 0 movq mm7, rounding 

movq mmO, [eax] / / i3 i2 il iO 

mov eax, out 

pshufw mml, mmO, 10001000b // i2 iO i2 iO 
pmaddwd mml, xconstl / / mml = b[l] « 15 b[0] « 15 
1 5 pshufw mm2, mmO, 11011101b / / i3 il i3 il 

pmaddwd rnrn2, xconst2 // mm2 = b[2] « 15 b[3] « 15 
movqmrnS, mml 

paddd mml, mm2 // ol « 15 oO « 15 

psubdmrr6 / rnrn2 //o2«15 o3 « 15 

20 paddd mml, mm7 // do proper rounding 

paddd inmS, mm7 
psrad mml, 15 

M psrad mm3, 15 

packssdw mml, mm3 // o2 o3 ol oO 

-3125 pshufw mm2, mml, 10110100b // o3 o2 ol oO 

_ -.J movq [eax], mm2 



-"30 void fsaidct5„mmx (short in[5], short out[5]) 

yi { 

s static _int64 xconstl = 0xB3022F952F954CFE; // -fl f2 f2 fl 

Q static _Jnt64 xconst2 - 0xBE82E6FC1904417E; // -f3 -f4 f4 £3 

static _Jnt64 xconst3 = 0x0000393E0000393E; // 0 fO 0 fO 
s7s35 static _Jnt64 xconst4 = 0x0000000050F4AF0C; // 0 0 f5 -f5 

Zl static __Jnt64 rounding = 0x0000400000004000; 

M asm { 

O mov eax, in 

40 movq mmO, [eax+2] / / mmO = i4 i3 i2 il 

movd mm6, [eax] // x x il iO 

movq mm7, rounding 

pshufw mml, mmO, 10001000b // i3 il i3 il 
pmaddwd mml, xconstl / / mml = b2 « 15 bl « 15 
45 pshufw mm2, mmO, 11011101b / / i4 i2 i4 i2 

movqmmS, mm2 

pmaddwd mrr^, xconst2 / / b4 « 15 b3 « 15 
pshufw mm3, mm6, 00000000b / / iO iO iO iO 
pmaddwd mm3, xconst3 // mm3 = bO « 15 bO « 15 
50 padddrrm^,mm3 // mm2 = b4 « 15 b3 « 15 

movq mm4, mm2 

paddd rruri4, rnml // ol « 15 oO « 15 

psubdirur»2,mml //o3«15 o4 « 15 

pmaddwd mm5, xconst4 // x (i4 - i2) * f5 
55 paddd mn^mmS // x o2 « 15 

mov eax, out 

paddd mm4, mm7 / / do proper rounding 

paddd mm2, mm7 
60 paddd mm5, mm7 

psrad mm4, 15 
psrad mm2, 15 
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psrad mm5, 15 

packssdw mm4 y mm2 / / o3 o4 ol oO 

packssdw mm5 ; mm7 / / x x x o2 

movd [eax], rnm4 / / store oO, ol 
pshufw mntf, mm4, 00001011b / / x x o4 o3 

movd [eax+4], mm5 / / store o2 

movd [eax+6], mm3 / / store o3, o4 



void fsaidct6_mmx (short in[6], short out[6]) 



static __int64 xconstl = OxC000344140003441; // -fl fO fl fO 
static _Jnt64 xconst2 = 0x000024F33441B619; // 0 £2 fO -f3 
15 static „int64 xconst3 = 0x4762132013204762; // f5 f4 f4 f5 

static „int64 xconst4 = 0x344100003441CBBF; // fO 0 fO -fO 
static __int64 rounding = 0x0000400000004000; 



20 



asm { 

mov eax, in 

movq mmO, [eax] / / i3 i2 il iO 

-J movdmml, [eax+8] / / 0 0 i5 i4 

4 punpcklwd mml, mmO // il i5 iO i4 

."1 pshufw mrr^, mmO, 10001000b // i2 iO i2 iO 

. s |25 pmaddwd rnrriZ, xconstl // bO-bl « 15 bO+bl « 15 

* pshufw mm3, mml, 00000100b / / i4 i4 iO i4 

j pmaddwd mm3, xconst2 // mm3 = (b2) « 15 b4 « 15 

130 pshufw rrrni4,mm3,11101110b //b2«15 b2 « 15 

padddmrn2 / rnrn4 // mm2 = b5 « 15 b3 « 15 

\ pshufw mm4, mml, 11101110b / / il i5 il i5 

s mov eax, out 

I 35 pmaddwd mm4, xconst3 / / il*f5+i5*f4 « 15 il*f4+i5*f5 « 15 

f pextrw ecx, rnml, 2 

f movq mm7, rounding 

I pinsrw mmO, ecx, 0 // i3 i2 il i5 

pmaddwd mmO, xconst4 // b2 « 15 (il-i5)*f0 « 15 
40 pshufw mml, mmO, 11101110b / / b2 « 15 b2 « 15 

mov T qmm6 ; mml 

padddrnml / mrn4 // bO « 15 x 

psubd mm4, mm6 / / x b2 « 15 

psubd mmO, mm6 / / mmO = x bl « 15 

45 psrlqmml,32 
psllq mm4 / 32 

paddd mml, rnrn4 // mml - hi « 15 bO « 15 

movq mrr5 / mml 
movq mm6 ; mmO 

50 paddd mrnl,mm2 // o2 « 15 oO « 15 

psuMmrn2,rnm5 // o3 « 15 o5 « 15 

padddmm0 / mm3 // x ol « 15 

psubd mmS, mm6 // x o4 « 15 

paddd mmO, mm7 / / do proper rounding 

5 5 paddd mml, mm7 

paddd rnrriZ, mm7 
paddd mm3, mm7 
psrad mmO, 15 
psrad mml, 15 

60 psrad mm2, 15 

psrad mm3, 15 

packssdw mml, mmO // x ol o2 oO 
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packssdw mm2, mm3 / / x o4 o3 o5 

psllq mml, 16 / / ol o2 oO 0 

pshufw mmi mrn2, 01010010b / / o3 o3 o5 o4 
pextrw ecx, mm4, 3 

pinsrw mml, ecx, 0 / / ol o2 oO o3 

pshufw mm3, mml, 00101101b / / o3 o2 ol oO 
movq [eax], mm3 / / save ol, o2, o3, o4 

movd [eax+8], rnm4 / / save o5, 06 
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void fsaidct7_mmx (short in[7], short out[7]) 

{ 

static _int64 xconstl = 0x357EE25042B4357E; // f3 -f5 fl f3 
15 static _„int64 xconst2 = 0x0F39C25B3DA52AA9; // f6 -f2 f2 f4 

static __int64 xconst3 = 0xlDB0BD4CD557F0C7; // f5 -fl -f4 -f6 
static „int64 xconst4 = 0x0000BD4C00001DB0; / / 0 -fl 0 f5 
static „int64 xconstS = 0x3061 D55 7306 10F39; // fO -f4 fO f6 
static „int64 xconst6 = Ox0000357E30613DA5; // 0 f3 fO f2 
20 static _int64 xconst7 = 0x446BBB953061BB95; // f7 -f7 fO -f7 

static __int64 rounding = 0x0000400000004000; 

UJ asm { 

HO mov eax r in 

Q125 movqmmO, [eax+2] // i4 i3 i2 il 

\j pshufw mm6, mmO, 00100010b // il 13 il i3 

^ pmaddwd rnm6,xconstl // il*f3-i3*f5 il*fl+i3*f3 

// (b2) (bl) 
pshufw mmS, mmO, 01 1 101 1 lb / / i2 i4 i2 14 
- : 30 pmaddwdmm5 / xconst2 // i2*f6-i4*f2 i2*f2+i4*f4 

// (b5) (b4) 
a pshufw rnm4, mmO, 00100111b // il i3 i2 i4 

rl prmddwdrnm4,xconst3 // il*f5-i3*fl -12*f4-i4*f6 

// (b3) (b6) 

hi 35 

mov ecx, [eax] / / il iO 

y movd rnm7, [eax+10] / / 0 0 i6 i5 

W-P mov eax, out 

O pinsrw 1X0X17, ecx, 3 / / iO 0 i6 i5 

40 pshufw mm3, mm7, 00000000b // i5 i5 i5 i5 

pmaddwd mm3, xconst4 

padddrnm3,mm6 //mm3^b2«15 bl « 15 

pshufw mrr^ mm7, 11011101b / / iO i6 iO i6 
45 pmaddwd mm2, xconstS 

padddirur^mrnS / / mm2 = b5 « 15 b4 « 15 

pshufw mml, mm7, 00001101b / / 15 i5 iO i6 
pmaddwd mml, xconst6 
50 padddmml,mm4 / / mml = b3 « 15 b6 « 15 

pshufw mm4, mm7, 10101101b / / 0 0 iO i6 
movq mm7, rounding 

pshufw mm5,mm0, 11111101b // x x i4 il 
55 psllq mm5, 32 

padddmm4,mm5 // i4 i2 iO i6 

pmaddwd mm4, xconst7 
pshufw mm5, mm4, 00001 110b / / x x i4 i2 
padddmm4,rnm5 / / x o3 « 15 



60 



movq mm5, mm2 

paddd mm5, mm3 / / mm5 = ol « 15 oO « 15 
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psubd mm2, mm3 // rnrn2 = o5 << 15 06 « 15 

pshufw mm3, mml, 00001110b / / x b3 « 15 
movqmrn^ mml 

paddd rnml,mm3 // x o2 « 15 

psubd rnm6, mm3 // mm6 = xo4 « 15 



psllq mini 32 
psHq mml, 32 
psrlq mml, 32 

paddd mml, mm4 / / mml = o3 « 15 o2 « 15 

paddd mml, mm7 / / do proper rounding 

paddd mm2 r mm7 

paddd mm5, rnm7 

paddd 1x0116, mm7 

psrad mml, 15 

psrad mm2, 15 

psrad mm5, 15 

psrad mm6, 15 

packssdw rnrnS, mml // o3 o2 ol oO 

packssdw mm2, mm6 // x o4 o5 06 

pshufw mrn3, rnm2, 00010010b // 06 o5 x o4 



movq [eax}, mrn5 
movd [eax+8], mm3 
psrlq mm3, 32 
movd [eax+10], mm3 



/ / store oO, ol, o2, o3 
// store o4 

/ / store o5, 06 



void fsaidct8_mmx (short in[8], short out[8]) 
{ 

static _int64 xconstl = 0x0C7C3EC5C13B0C7C;// f7 fl -fl f7 
static _int64 xconst2 = 0x238E35363536DC72; // f5 f3 f3 -f5 
static _int64 xconst3 = 0x2D41D2BF2D412D41;// f4 -f4 f4 f4 
static __int64 xconst4 = 0x3B20187D187DC4E0; // £2 f6 f6 -f2 
static „int64 xconst5 = 0x00005 A8200005A82; // 0 fO 0 fO 
static _int64 rounding = 0x0000400000004000; 



_asm { 

mov eax, in 

movqmmO, [eax] // i3 i2 il iO 

movq mml, [eax+8] // i7 16 i5 i4 

mov eax, out 

pshufw mm2, mmO, 10001101b / / i2 iO i3 il 
pshufw mm3, mml, 00100111b / / i4 i6 i5 i7 
movq mml, mm2 

pimpcldwdmrn2,rnm3 // i5 i3 i7 il 

punpckhwd mml, inm3 // i4 i2 i6 iO 



pshufw mm3, mm2, 01000100b / / i7 il \7 il 

pmaddwd rnm3, xconstl / / b3[l ] b3[0] 
pshufw mm4, mm2, 11101110b / / i5 i3 i5 i3 

pmaddwd 1x0x14, xconst2 / / b3[3] b3[2] 



pshufw mmO, mml, 00110011b 
pmaddwd mmO, xconst3 
pshufw mm2, mml, 10011001b 
pmaddwd mm2, xconst4 
pshufw mm5, mm2, 01001110b 



// iO i4 iO i4 
//mm0 = bl b0(*A) 
// i2 i6 i2 16 

//mm5=:b2 b3 (*A) 
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10 



15 



20 



25 



30 



35 



40 



45 



movq mml, mm3 
psulximm3,mm4 
movq mm7, mm3 
punpckhdq nirrLZ, mm6 
movq mm2, mm7 
paddd mn^, mm3 
psubd mm2 ; mm3 
punpckldq mm2, mm7 
movq mir^, rounding 
paddd mm2, mm7 
psrad mrrL*, 15 

paddd mml, mm4 



//b6«15 b5«15 

/ / x b6 « 15 

// x b[6]+b[5] 
// x b[6]-b[5] 
/ / b6+b5 b6-b5 



movq mm4, mmO 

paddd mm0,nim5 

psubd mm4 / mm5 

pshufw mm6, mm4, 01001110b 

pmaddwd mni2, xconst5 

movq mm5, mm2 
punpckldq mm2, mml 
punpckhdq mml, mm5 

movq mm3, mmO 
paddd mmO, mml 
psubd mm3, mml 
movq mm5, mm6 
paddd mm6, mm2 
psubd mm5, mm2 

paddd mmO, mm7 
paddd mm3, mm7 
paddd mm6, mm7 
paddd mmS, mm7 
psrad mmO, 15 
psrad mm3, 15 
psrad mm6, 15 
psrad mmS, 15 

packssdw mmO, mm6 
packssdw mmS, mm5 
pshufw mml, mm3, 00011011b 
movq [eax], mmO 
movq [eax+8], mml 



//b2[7] b2[4] 

//mm0 = b2[l] b2[0] (*C) 

//mm6 = b2[3] b2[2] (*C) 
//b2[6] b2[5] 



//mm2 = b2[4] b2[5] (*C) 
//mml=b2[6] b2[7] (*C) 



// ol « 15 
// 06 « 15 

// o3 « 15 
// o4 « 15 



oO « 15 
o7 « 15 

o2 « 15 
o5 « 15 



/ / do proper rounding 



// o3 o2 ol oO 
// o4 o5 06 o7 
/ / o7 06 o5 o4 



50 
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CLAIMS 

What is claimed is: 



1 1. A method comprising: 

2 multiplying [A] by [x] to obtain [y]; 

3 wherein said [x] is a matrix of inputs, said [y] is a matrix of outputs, and 

4 said [A] is a matrix of predetermined values and multiplication operations; 

5 and 

6 wherein said multiplication operations within said [A] are paired. 

1 2. The method as in claim 1, 

2 wherein said matrix [A] is factored into a butterfly matrix [B], a shuffle 

3 matrix [S], and a multiplication matrix [M]; and 

4 wherein multiplication operations within said multiplication matrix 

5 [M] are grouped for simultaneous execution. 

1 3. The method as in claim 1, wherein at least one n-point discrete cosine 

2 transform (DCT) is performed. 

1 4. The method as in claim 3, wherein multimedia compression is 

2 performed* 

1 5. The method as in claim 3, wherein at least one shape adaptive discrete 

2 cosine transform (SA-DCT) is performed. 

1 6. The method as in claim 1, wherein at least one n-point inverse discrete 

2 cosine transform (IDCT) is performed. 

1 7. The method as in claim 6, wherein multimedia decompression is 

2 performed. 

1 8. The method as in claim 6, wherein at least one SA-IDCT is performed. 

1 9. The method as in claim 1, implemented using single instruction 

2 multiple data (SIMD) operations. 
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3 10. The method as in claim 10, implemented using MMX operations. 

1 11. The method as in claim 10, implemented using PMADDWD 

2 instructions. 

1 12. The method as in claim 1, implemented using at least one of very large 

2 scale integration (VLSI) implementation, single processor implementation, 

3 vector processing. 

1 13. A machine readable storage medium having executable instructions 

2 which, when executed by a machine, cause said machine to perform 

3 operations comprising: 

4 multiplying [A] by [x] to obtain [y]; 

5 wherein said [x] is a matrix of inputs, said [y] is a matrix of outputs, and 

6 said [A] is a matrix of predetermined values and multiplication operations; 

7 and 

8 wherein said multiplication operations within said [A] are paired. 

1 14. The machine readable storage medium as in claim 13, 

2 wherein said matrix [A] is factored into butterfly matrix [B], shuffle 

3 matrix [S], and multiplication matrix [M]; and 

4 wherein multiplication operations within said multiplication matrix 

5 [M] are grouped for simultaneous execution. 

1 15. The machine readable storage medium as in claim 13, wherein at least 

2 one n-point DCT is performed. 

1 16. The machine readable storage medium as in claim 15, wherein 

2 multimedia compression is performed. 

1 17. The machine readable storage medium as in claim 15, wherein at least 

2 one SA-DCT is performed. 

1 18. The machine readable storage medium as in claim 13, wherein at least 

2 one n-point IDCT is performed. 
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1 19. The machine readable storage medium as in claim 18, wherein 

2 multimedia decompression is performed. 

1 20. The machine readable storage medium as in claim 18, wherein at least 

2 one SA-IDCT is performed. 

1 21. The machine readable storage medium as in claim 13, implemented 

2 using SIMD operations. 

1 22. The machine readable storage medium as in claim 21, implemented 

2 using MMX operations. 

1 23. The machine readable storage medium as in claim 22, implemented 

2 using PMADDWD instructions. 

1 24. The machine readable storage medium as in claiml3, implemented 

2 using at least one VLSI implementation, single processor implementation, 

3 vector processing. 

1 25. A method comprising performing an n-point DCT or an n-point IDCT 

2 wherein multiplication operations and addition operations within said n- 

3 point DCT and said n-point IDCT are paired. 

1 26. The method as in claim 25, further comprising performing SA-DCT or 

2 SA-IDCT. 

1 27. The method as in claim 25, implemented using instructions that can 

2 execute multiple operations in parallel. 

1 28. The method as in claim 27, said instructions being at least one of 

2 MMX™ operations and Streaming SIMD Extensions. 
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ABSTRACT 

An efficient implementation of n-point discrete cosine transform, n- 
point inverse discrete cosine transform, shape adaptive discrete cosine 
transform and shape adaptive inverse discrete cosine transform algorithms 
for multimedia compression and decompression optimization. An n-point 
DCT function is represented by a first equation having an input matrix, an 
output matrix and a matrix of predetermined values. An n-point IDCT 
function is represented by a second equation having an input matrix, an 
output matrix and a matrix of predetermined values. The multiplication 
operations within the matrix of predetermined values are paired, thereby 
reducing processor instructions. SIMD operations, MMX operations, VLSI 
implementation, single processor implementation, and vector processing are 
used to perform the algorithms. 
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